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Abstract 

A  method  for  the  simultaneous  localization  and  recogni¬ 
tion  of  dynamic  hand  gestures  is  proposed.  At  the  core  of 
this  method  is  a  dynamic  space-time  warping  (DSTW)  al¬ 
gorithm ,  that  aligns  a  pair  of  query  and  model  gestures 
in  both  space  and  time.  For  every  frame  of  the  query 
sequence ,  feature  detectors  generate  multiple  hand  region 
candidates.  Dynamic  programming  is  then  used  to  compute 
both  a  global  matching  cost,  which  is  used  to  recognize  the 
query  gesture,  and  a  warping  path,  which  aligns  the  query 
and  model  sequences  in  time,  and  also  finds  the  best  hand 
candidate  region  in  every  query  frame.  The  proposed  frame¬ 
work  includes  translation  invariant  recognition  of  gestures, 
a  desirable  property  for  many  HCI  systems.  The  perfor¬ 
mance  of  the  approach  is  evaluated  on  a  dataset  of  hand 
signed  digits  gestured  by  people  wearing  short  sleeve  shirts, 
in  front  of  a  background  containing  other  non-hand  skin- 
colored  objects.  The  algorithm  simultaneously  localizes  the 
gesturing  hand  and  recognizes  the  hand-signed  digit.  Al¬ 
though  DSTW  is  illustrated  in  a  gesture  recognition  setting, 
the  proposed  algorithm  is  a  general  method  for  matching 
time  series,  that  allows  for  multiple  candidate  feature  vec¬ 
tors  to  be  extracted  at  each  time  step. 

1.  Introduction 

Hand  gestures  are  an  important  modality  for  human  com¬ 
puter  interaction  (HCI)  [15].  Compared  to  many  existing 
interfaces,  hand  gestures  have  the  advantages  of  being  easy 
to  use,  natural,  and  intuitive.  Successful  applications  of 
hand  gesture  recognition  include  computer  games  control 
[7],  human-robot  interaction  [22],  and  sign  language  recog¬ 
nition  [21],  to  name  a  few.  Vision-based  recognition  sys¬ 
tems  can  give  computers  the  capability  of  understanding 
and  responding  to  hand  gestures.  The  usability  of  such  sys¬ 
tems  greatly  depends  on  their  ability  to  function  reliably 
in  common  real-world  environments,  without  requiring  the 
user  to  wear  special  clothes  or  cumbersome  devices  such  as 
colored  markers  or  gloves  [22] . 


*This  research  was  funded  in  part  by  NSF  grants  CNS-0202067,  IIS- 
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Most  hand  gesture  recognition  systems  assume  that  the 
gesturing  hand  can  be  reliably  located  in  every  frame  of  the 
input  sequence.  In  many  real  life  settings  this  assumption 
cannot  be  satisfied.  For  example,  in  Figure  1  skin  detec¬ 
tion  yields  multiple  hand  candidates,  and  the  top  candidate 
is  often  not  correct.  Other  visual  cues  commonly  used  for 
hand  detection  such  as  motion,  edges,  and  background  sub¬ 
traction  [2,  13]  would  also  fail  to  unambiguously  locate  the 
hand  in  the  image.  Motion-based  detection  and  background 
subtraction  may  fail  to  uniquely  identify  the  location  of  the 
hand  when  the  face,  non-gesturing  hand  or  other  scene  ob¬ 
jects  are  moving.  At  the  same  time,  such  methods  can  be 
used  to  produce  a  relatively  short  list  of  candidate  hand  lo¬ 
cations. 

The  proposed  approach  is  a  principled  method  for  ges¬ 
ture  recognition  in  domains  where  existing  algorithms  can¬ 
not  reliably  localize  the  gesturing  hand.  Instead  of  assuming 
perfect  hand  detection,  we  make  the  milder  assumption  that 
a  list  of  candidate  hand  locations  is  available  for  each  frame 
of  the  input  sequence.  At  the  core  of  our  framework  is  a  dy¬ 
namic  space-time  warping  (DSTW)  algorithm,  that  aligns  a 
pair  of  query  and  model  gestures  in  time,  while  at  the  same 
time  it  identifies  the  best  hand  location  out  of  the  multiple 
hypotheses  available  at  each  query  frame.  The  main  advan¬ 
tages  of  our  method  are  the  following: 

•  Hand  detection  is  not  merely  a  bottom-up  procedure. 
The  gesture  model  is  used  to  select  hand  locations  in 
a  way  that  the  query-to-model  matching  cost  is  opti¬ 
mized. 

•  Recognition  can  be  achieved  even  in  the  presence  of 
multiple  “distractors,”  like  moving  objects,  or  skin- 
colored  objects  (e.g.,  face,  non-gesturing  hand,  back¬ 
ground  objects). 

•  Recognition  is  robust  to  overlaps  between  the  gestur¬ 
ing  hand  and  the  face  or  the  other  hand. 

•  Recognition  is  translation-invariant;  the  gesture  can 
occur  in  any  part  of  the  image. 

•  Unlike  HMMs  and  CONDENSATION-based  gesture 
recognition  our  method  requires  no  knowledge  of  ob- 
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Figure  1:  Detection  of  candidate  hand  regions  based  on  skin 
color.  Clearly,  skin  color  is  not  sufficient  to  unambiguously  de¬ 
tect  the  gesturing  hand  since  the  face,  the  non-gesturing  hand,  and 
other  objects  in  the  scene  have  similar  color.  On  the  other  hand, 
for  this  particular  scene,  the  gesturing  hand  is  consistently  among 
the  top  15  candidates  identified  by  skin  detection. 


servation  and  transition  densities,  and  therefore  can  be 
applied  even  if  we  have  a  single  example  per  class. 

•  Although  this  paper  describes  DSTW  in  the  context  of 
gesture  localization  and  recognition,  DSTW  is  a  gen¬ 
eral  method  for  matching  time  series,  that  can  accom¬ 
modate  multiple  candidate  feature  vectors  at  each  time 
step. 

Inspired  by  previous  vision-based  HCI  systems  (e.g.,  the 
virtual  whiteboard  by  Black  and  Jepson  [1],  and  the  virtual 
drawing  package  by  Isard  [9],  to  name  a  few),  we  evaluate 
our  framework  on  a  vision-based  character  recognition  task. 

2.  Related  Work 

In  most  dynamic  gesture  recognition  systems  (e.g.,  [4,  21]) 
information  flows  bottom-up:  the  video  is  input  into  the 
analysis  module,  which  estimates  the  hand  pose  and  shape 
model  parameters,  and  these  parameters  are  in  turn  fed  into 
the  recognition  module,  which  classifies  the  gesture  [15].  In 
a  bottom-up  framework,  tracking  and  recognition  typically 
fail  in  the  absence  of  perfect  hand  segmentation. 

The  method  proposed  in  this  paper  is  an  extension  of  Dy¬ 
namic  Time  Warping  (DTW).  DTW  was  originally  intended 
to  recognize  spoken  words  of  small  vocabulary  [12,  16].  It 
was  also  applied  successfully  to  recognize  a  small  vocab¬ 
ulary  of  gestures  [3,  5].  The  DTW  algorithm  temporally 
aligns  two  sequences,  a  query  sequence  and  a  model  se¬ 
quence,  and  computes  a  matching  score,  which  is  used  for 
classifying  the  query  sequence.  The  time  complexity  of  the 
basic  DTW  algorithm  is  quadratic  in  the  sequence  length, 
but  more  efficient  variants  have  been  proposed  [19,  11]).  In 
DTW,  it  is  assumed  that  a  feature  vector  can  be  reliably  ex¬ 
tracted  from  each  query  frame.  However,  this  assumption 
is  often  hard  to  satisfy  in  vision-based  systems,  where  the 
gesturing  hand  cannot  be  located  with  absolute  confidence. 


A  framework  that  allows  for  multiple  detections  of  candi¬ 
date  hand  regions,  or  more  generally  multiple  observations, 
is  therefore  required. 

In  multiple  hypothesis  tracking  (e.g.,  [17])  multiple  hy¬ 
potheses  are  associated  with  multiple  observations.  Each 
observation  corresponds  to  a  different  object  with  a  differ¬ 
ent  model.  In  contrast,  in  the  proposed  framework  a  single 
consistent  hypothesis  is  selected  among  multiple  distinct 
observations  (detections),  only  one  of  which  is  correct.  The 
CONDENSATION-based  framework  can  also  be  applied  to 
gesture  recognition  [1].  Although  in  principle  CONDEN¬ 
SATION  can  be  used  for  both  tracking  and  recognition, 
in  [1]  CONDENSATION  was  only  used  for  the  recogni¬ 
tion  part,  once  the  trajectory  had  been  reliably  estimated 
using  a  color  marker.  Even  given  the  trajectory,  system  per¬ 
formance  was  reported  to  be  significantly  slower  than  real 
time,  due  to  the  large  number  of  hypotheses  that  needed  to 
be  evaluated  and  propagated  at  each  frame.  Also,  to  use 
CONDENSATION  we  need  to  know  the  observation  den¬ 
sity  and  propagation  density  for  each  state  of  each  class 
model,  whereas  in  our  method  no  such  knowledge  is  nec¬ 
essary. 

The  work  by  Sato  and  Kobayashi  [20]  is  the  most  re¬ 
lated  to  our  work.  In  the  Hidden  Markov  Model  (HMM) 
framework,  Sato  and  Kobayashi  extended  the  Viterbi  algo¬ 
rithm  so  that  multiple  candidate  observations  can  be  accom¬ 
modated  at  each  query  frame;  the  optimal  state  sequence 
is  constrained  to  pass  through  the  most  likely  candidate  at 
every  time  step.  HMMs  have  found  wider  application  for 
problems  with  large  vocabulary  (of  words  or  gestures)  pri¬ 
marily  due  to  their  ability  to  probabilistically  encode  the 
variability  of  the  training  data.  However,  DTW  can  still 
be  appropriate  for  smaller  problems  because  it  is  simpler 
to  implement:  there  is  no  need  to  worry  about  the  HMM 
structure,  and  no  training  is  required.  Furthermore,  our  ap¬ 
proach  differs  from  [20]  in  that  it  incorporates  translation 
invariance,  and  is  evaluated  in  a  more  challenging  setting 
(users  are  wearing  short  sleeve  shirts  and  the  hand  is  not  an 
isolated  skin-colored  blob). 

3.  Detection  and  Feature  Extraction 

The  overall  algorithm  consists  of  three  major  components: 
detection  of  multiple  candidate  hand  regions,  feature  extrac¬ 
tion,  and  hand  gesture  recognition. 

3.1.  Detection 

The  proposed  method  has  been  designed  to  accommodate 
multiple  hypotheses  for  the  hand  location  in  each  frame. 
Therefore,  we  can  afford  to  use  a  relatively  simple  and  ef¬ 
ficient  hand  detection  scheme.  In  our  implementation  we 
combine  two  visual  cues,  i.e.,  color  and  motion;  both  re¬ 
quiring  only  a  few  operations  per  pixel. 


The  skin  detector  first  computes  for  every  image  pixel  a 
skin  likelihood  term.  For  the  first  frames  of  the  sequence, 
where  a  face  has  still  not  been  detected,  we  use  a  generic 
skin  color  histogram  [10]  to  compute  the  skin  likelihood 
image.  Once  a  face  has  been  detected  [18],  we  use  the 
mean  and  covariance  of  the  face  skin  pixels  in  normalized 
rg  space  to  compute  the  skin  likelihood  image. 

The  motion  detector  computes  a  mask  by  thresholding 
the  result  of  frame  differencing.  If  there  is  significant  mo¬ 
tion  between  the  previous  and  current  frame  the  motion 
mask  is  applied  to  the  skin  likelihood  image  to  obtain  the 
hand  likelihood  image.  Using  the  integral  image  [23]  of 
the  hand  likelihood  image,  we  efficiently  compute  for  every 
subwindow  of  some  predetermined  size  the  sum  of  pixel 
likelihoods  in  that  subwindow.  Then  we  extract  the  K  sub¬ 
windows  with  the  highest  sum,  such  that  none  of  the  K 
subwindows  may  include  the  center  of  another  of  the  K 
sub  windows.  If  there  is  no  significant  motion  between  the 
previous  and  current  frame,  then  the  previous  K  subwin¬ 
dows  are  copied  over  to  the  current  frame. 

A  distinguishing  feature  of  our  hand  detection  algorithm 
compared  to  most  existing  methods  [2]  is  that  we  do  not  use 
connected  component  analysis  to  find  the  largest  component 
(discounting  the  face),  and  associate  it  with  the  gesturing 
hand.  The  connected  component  algorithm  may  group  the 
hand  with  the  arm  (if  the  user  is  wearing  a  shirt  with  short 
sleeves),  or  with  the  face,  or  with  any  other  skin-colored  ob¬ 
jects  with  which  the  hand  may  overlap.  As  a  result  the  hand 
location,  which  is  typically  represented  by  the  largest  com¬ 
ponent’s  centroid,  will  be  incorrectly  estimated.  In  contrast, 
our  hand  detection  algorithm  maintains  for  every  frame  of 
the  sequence  multiple  sub  windows,  some  of  which  may  oc¬ 
cupy  different  parts  of  the  same  connected  component.  The 
gesturing  hand  is  typically  covered  by  one  or  more  of  these 
sub  windows  (See  Figure  1). 

3.2.  Feature  Extraction 

For  every  frame  j  of  the  query  sequence,  K  candidate  hand 
regions  are  found  as  described  in  the  previous  section.  For 
every  candidate  k  in  frame  j  a  4D  feature  vector  Q3k  = 
(xjk,Ujk,Ujk,Vjk)  is  extracted.  The  2D  position  (x,y)  is 
the  region  centroid,  and  the  2D  velocity  (u,  v)  is  the  optical 
flow  averaged  over  that  region.  Optical  flow  is  computed 
using  a  block-based  matching  method  [24] . 

In  our  current  implementation,  when  we  collect  the 
model  sequences,  a  colored  glove  is  used  to  reliably  detect 
the  gesturing  hand.  Using  such  additional  constraints,  like 
colored  markers,  is  often  desirable  for  the  offline  model¬ 
building  phase,  because  it  simplifies  the  construction  of  ac¬ 
curate  class  models.  It  is  important  to  stress  that  such  mark¬ 
ers  are  not  used  in  the  query  sequences,  and  therefore  they 
do  not  affect  the  naturalness  and  comfort  of  the  user  inter¬ 
face,  as  perceived  by  the  end  user  of  the  system. 


Based  on  the  features  extracted  from  all  database  se¬ 
quences,  we  compute  parameters  that  translate  and  scale 
those  features,  so  that  they  lie  inside  the  unit  hypercube. 
We  transform  the  feature  vectors  of  each  query  frame  using 
the  same  parameters. 

4  Dynamic  Space-Time  Warping 

One  of  several  publications  that  describe  the  DTW  algo¬ 
rithm  is  [1 1].  In  this  section  we  will  describe  dynamic  space 
time  warping,  which  is  an  extension  of  DTW  that  can  han¬ 
dle  multiple  candidate  detections  in  each  frame  of  the  query. 

Let  M  =  (Mi, . . . ,  Mm)  be  a  model  sequence  in  which 
each  Mi  is  a  feature  vector.  Let  Q  =  (Qi, . . . ,  Qn)  be 
a  query  sequence.  In  the  regular  DTW  framework,  each 
Qj  would  be  a  feature  vector,  of  the  same  form  as  each 
Mi.  However,  in  dynamic  space-time  warping  (DSTW), 
we  want  to  model  the  fact  that  we  have  multiple  candidate 
feature  vectors  in  each  frame  of  the  query.  For  example, 
if  the  feature  vector  consists  of  the  position  and  velocity  of 
the  hand  in  each  frame,  and  we  have  multiple  hypotheses  for 
hand  location,  each  of  those  hypotheses  defines  a  different 
feature  vector.  Therefore,  in  our  algorithm,  Qj  is  a  set  of 
feature  vectors:  Qj  =  {Qji, . . . ,  Qjx},  where  each  Qjj. c, 
for  k  G  {1, . . . ,  K},  is  a  candidate  feature  vector.  K  is  the 
number  of  feature  vectors  extracted  from  each  query  frame. 
In  our  algorithm  we  assume  K  is  fixed,  but  in  principle  K 
may  vary  from  frame  to  frame. 

A  warping  path  W  defines  an  alignment  between  M  and 
Q.  Formally,  W  =  uq, . . . ,  wt,  where  max(m,  n)  <  T  < 
m  +  n  —  1.  Each  wt  =  (i,  j,  k)  is  a  triple,  which  specifies 
that  feature  vector  Mi  of  the  model  is  matched  with  feature 
vector  Qjk.  We  say  that  wt  has  two  temporal  dimensions 
(denoted  by  i  and  j )  and  one  spatial  dimension  (denoted 
by  k).  The  warping  path  is  typically  subject  to  several  con¬ 
straints  (adapted  from  [11]  to  fit  the  DSTW  framework): 

•  Boundary  conditions:  w%  =  (1,1,  fc)  and  wT  = 
(m,  n,k').  This  requires  the  warping  path  to  start  by 
matching  the  first  frame  of  the  model  with  the  first 
frame  of  the  query,  and  end  by  matching  the  last  frame 
of  the  model  with  the  last  frame  of  the  query.  No  re¬ 
strictions  are  placed  on  k  and  k',  which  can  take  any 
value  from  1  to  K. 

•  Temporal  continuity:  Given  wt  =  (a,  6,  k)  then 

wt- 1  =  (a',  6',  k/),  where  a  —  a'  <  1  and  b  —  b'  <  1. 

This  restricts  the  allowable  steps  in  the  warping  path 
to  adjacent  cells  along  the  two  temporal  dimensions. 

•  Temporal  monotonicity:  Given  wt  =  (a,  6,  k)  then 

wt- %  =  (a',  6',  k')  where  a  —  a'  >  0  and  b  —  b'  > 
0.  This  forces  the  warping  path  sequence  to  increase 
monotonically  in  the  two  temporal  dimensions. 


input  :  A  sequence  of  model  feature  vectors  ,  1  < 
i  <m,  and  a  sequence  of  sets  of  query  feature 
vectors  Qj  =  {Qji, . . . ,  1  <  j  <  n. 

output  :  A  global  matching  cost  D* ,  and  an  optimal 
warping  path  VF*  =  (w*, . . . ,  Wj). 

//  Initialization 
3=0 

for  i  =  0  :  m  do 
for  k  =  1  :  K  do 

j  D(i,j,k)  =  oo 

end 

end 

£>(0,0,1)  =  0 

//  Iteration 

for  j  =  1  :  n  do 
for  i  =  0  :  m  do 
for  k  =  1  :  K  do 
if  i  =  0  then 

j  D(i,j,k)=  oo 

end 

else 

w  = 

for  re'  G  7V(u>)  do 
|  C(w',  w)  =  t(u/,  re)  +  D(wf), 

end 

£>(w)  =  d(w)  +  minw/eiV(™)  C(w',w) 
b(w)  =  a,Ygmmw,eN(w)C(w',w) 

end 

end 

end 

end 

//  Termination 

fc*  =  argmin^jP^ra,  n,  fc)} 

P>*  =  D(m,n,k*) 

=  (m,  n,  fc*) 

//  Backtrack 
<_i  =  &K*) 

Algorithm  1 :  The  DSTW  algorithm 

Note  that  continuity  and  monotonicity  are  required  only 
in  the  temporal  dimensions.  No  such  restrictions  are  needed 
for  the  spatial  dimension;  the  warping  path  can  “jump”  from 
any  spatial  candidate  k  to  any  other  spatial  candidate  k'. 

Given  warping  path  element  wt  =  (i,  j,  fc),  we  define  the 
set  N(i,  j,  k)  to  be  the  set  of  all  possible  values  of  wt- 1  that 
satisfy  the  warping  path  constraints  (in  particular  continuity 
and  monotonicity): 

k )  =  {(i-l,j),  (i,j- 1),  (i- 1,  j-  1)}x{1,  ...,K} 

(1) 

We  assume  that  we  have  a  cost  measure  d(i,j,k)  = 
d(Mi ,  Qjk)  between  two  feature  vectors  Mi  and  Qj/c-  We 


also  assume  that  we  have  a  transition  cost  r(rct-i,  w*)  be¬ 
tween  two  successive  warping  path  elements.  DSTW  finds 
the  optimal  path  VF*  and  the  global  matching  score  D*  as 
described  in  Algorithm  1 . 

For  this  algorithm  to  function  correctly  it  is  required  that 
=  0  when  re'  =  (0,j,  k)  or  w'  =  (z,0,fc).  For 
all  other  values  of  w\  r  must  be  defined  appropriately  in 
a  domain- specific  way.  The  function  r  plays  a  similar  role 
in  DSTW  as  state  transition  probabilities  play  in  the  HMM 
framework. 

4.1.  Translation  Invariance 

In  recognizing  hand  gestures,  a  commonly  used  feature  is 
position  of  the  hand.  Using  positions  as  features  is  appeal¬ 
ing  because  they  are  simple  to  extract,  and  are  highly  in¬ 
formative  about  the  gesture  content.  However,  they  are  not 
invariant  to  translation.  In  simple  DTW,  there  is  a  single 
candidate  per  frame,  and  therefore  the  trajectory  of  the  hand 
is  known.  In  that  case  we  can  achieve  translation  invari¬ 
ance  (i.e.,  invariance  with  respect  to  global  translation  of 
the  entire  gesture)  by  subtracting  from  the  entire  trajectory 
the  position  of  the  hand  in  the  first  frame. 

In  DSTW,  we  can  apply  this  strategy  to  the  model,  where 
the  trajectory  is  known  (recall  that  a  colored  glove  is  used  in 
the  model  sequences).  However,  for  the  test  sequence,  there 
are  multiple  candidates  at  each  frame  and  therefore  the  po¬ 
sition  of  the  hand  in  the  first  frame  is  not  known.  When 
position  is  used  as  a  feature,  we  can  achieve  translation  in¬ 
variance  as  follows:  given  the  K  candidate  regions  detected 
at  the  first  frame,  we  start  K  separate  DSTW  processes, 
running  in  parallel.  Each  such  process  corresponds  to  a 
candidate  k  among  the  K  regions  detected  in  the  first  frame. 
The  process  makes  the  assumption  that  k  was  the  cor¬ 
rect  candidate  in  the  first  frame,  and  normalizes  all  position 
features  in  subsequent  frames  by  subtracting  from  them  the 
position  of  the  k’ th  candidate  in  the  first  frame.  Note  that 
this  normalization  is  only  applied  to  the  position  compo¬ 
nent  of  the  feature  vector.  The  velocity  features  used  in  the 
experiments  are  translation-invariant  by  definition.  When 
all  frames  have  been  processed,  to  find  the  best  match  of 
the  observation  sequence  with  the  model,  we  need  to  find 
which  of  the  K  parallel  DSTW  processes  gave  the  lowest 
matching  cost  D*. 

DSTW  takes  O(Kmn)  time;  the  translation  invariant 
version  takes  0(K2mn).  Overall,  adding  translation  in¬ 
variance  increases  both  the  space  and  the  time  complexity 
of  the  algorithm  by  a  factor  of  K. 

5.  Experiments 

To  test  the  DSTW  algorithm  we  implemented  a  hand-signed 
digit  recognition  system  in  Matlab.  For  the  experiments  we 
have  collected  video  clips  of  three  users  gesturing  the  ten 


digits  in  the  style  of  Palm’s  Graft tti  Alphabet  [14]  (Figure 
2).  The  video  clips  were  captured  with  a  Logitech  3000  Pro 
camera  using  an  image  resolution  of  240  x  320.  A  total 
of  270  digit  exemplars  were  extracted  from  three  different 
types  of  video  clips  depending  on  what  the  user  wore: 

•  Colored  Gloves:  30  digit  exemplars  per  user  were 
stored  in  the  database  (See  Figure  3). 

•  Long  Sleeves:  30  digit  exemplars  per  user  were  used 
as  queries. 

•  Short  Sleeves:  30  digit  exemplars  per  user  were  used 
as  queries. 

Given  a  query  frame,  K  candidate  hand  regions  of  size 
40  x  30  were  detected  as  described  in  Section  3.1.  For  ev¬ 
ery  candidate  hand  region  in  every  query  frame,  a  feature 
vector  was  extracted  and  normalized  as  described  in  Sec¬ 
tion  3.2.  The  query  digit  was  then  matched  with  the  model 
exemplars  in  the  database:  for  the  user-dependent  experi¬ 
ments,  30  query  digits  of  one  user  were  matched  with  30 
database  digits  of  the  same  user;  for  the  user-independent 
experiments,  30  query  digits  of  one  user  were  matched  with 
all  60  database  digits  of  the  two  other  users.  The  class  of  the 
query  was  estimated  using  the  one  nearest  neighbor  (1-NN) 
rule,  and  classification  accuracy  rates  were  averaged  over 
the  three  users.  Examples  of  a  correct  match  and  a  false 
match  are  shown  in  Figures  6  and  7  respectively. 

5.1.  Experiment  1:  DSTW  vs.  DTW 

The  purpose  of  the  first  experiment  is  to  demonstrate  that 
the  DSTW  algorithm  outperforms  the  simple  DTW  algo¬ 
rithm  when  using  a  hand  detection  method  based  on  color 
and  motion  [2].  The  classification  rates  depicted  in  Table 
1  show  a  significant  (11.1%  —  21.1%)  increase  in  classifi¬ 
cation  accuracy  between  the  simple  DTW  algorithm,  which 
can  only  handle  a  single  (best)  candidate,  and  the  proposed 
DSTW  algorithm,  which  can  handle  multiple  candidates.  In 
addition,  the  graphs  in  Figure  4  show  the  initial  decreasing 
trend  of  the  classification  error  rate  as  K  increases.  At  some 
point  the  error  rate  stops  decreasing  since  additional  candi¬ 
dates  cause  more  false  matches.  The  optimal  value  for  K 
can  be  estimated  using  cross  validation. 

The  results  in  Table  1  also  show  that  the  classification 
accuracy  rates  for  the  short  sleeves  sequences  are  slightly 
worse  than  the  classification  accuracy  for  the  long  sleeves 
sequences.  This  is  to  be  expected,  because  the  gesturing 
hand  is  more  accurately  localized  when  the  user  wears  a 
long  sleeved  shirt.  However,  it  is  important  to  note  that 
the  classification  accuracy  for  the  short  sleeves  sequences 
would  be  much  worse  without  handling  multiple  candidate 
observations,  unless  much  more  sophisticated  hand  seg¬ 
mentation  and  detection  algorithms  were  employed. 


Figure  2:  Palm’s  Graffiti  digits. 


Figure  3:  Example  model  digits  extracted  using  a  colored  glove. 


Experiment 

User-dep. 

User-indep. 

Method 

DTW 

DSTW 

DTW 

DSTW 

Long  Sleeves 

81.1 

96.7 

76.7 

91.1 

Short  Sleeves 

82.2 

93.3 

70.0 

91.1 

Table  1:  Classification  accuracy  results.  The  results  for  DSTW 
are  for  K  —  8. 


Figure  4:  Classification  error  as  a  function  of  the  number  of  can¬ 
didates  for  the  experiment  with  short  sleeves  sequences.  The  ex¬ 
periment  with  long  sleeves  sequences  showed  similar  behavior. 


5.2.  Experiment  2:  Translation  Invariance 

The  purpose  of  the  second  experiment  is  to  demonstrate  the 
additional  benefit  of  incorporating  translation  invariance  in 
the  DSTW  framework  as  proposed  in  Section  4.1.  In  princi¬ 
ple,  translation  invariance  could  be  obtained  using  transla¬ 
tion  invariant  features  such  as  relative  position  with  respect 
to  the  hand  position  in  the  first  frame,  or  velocity.  However, 
in  practice  the  hand  position  in  the  first  frame  is  not  known, 
and  using  only  velocity  as  a  feature  causes  dramatic  drops  in 
classification  accuracy.  For  example,  accuracy  drops  from 
85.6%  to  22.2%  in  the  user-independent  experiment  with 
short  sleeves  sequences  using  K  =  12. 

On  the  other  hand,  if  both  absolute  position  and  velocity 
are  included  in  the  feature  vector  and  translation  invariance 
is  not  handled,  then  if  there  is  a  shift  of  the  gesture  in  the 
image  plane,  the  classification  error  rate  will  increase.  The 
graph  in  Figure  5  shows  the  increase  of  the  error  rate  as 
a  function  of  (a  synthetic)  increase  in  translation,  for  the 
user-independent  experiment  with  short  sleeves  sequences. 
Clearly,  for  the  translation  invariant  formulation,  the  error 
rate  does  not  change  as  translation  increases,  as  indicated 
by  the  horizontal  line  in  the  graph.  We  also  note  that  the 
reason  that  the  error  rates  when  using  absolute  position  and 
velocity  are  relatively  low  for  small  translation  is  that  all 
gestures  were  performed  approximately  at  the  same  location 
in  the  image  plane. 


Figure  5:  Classification  error  as  a  function  of  translation  for  the 
user-independent  experiment  with  the  short  sleeves  sequences  and 
K  —  8.  The  experiment  with  long  sleeves  sequences  showed  sim¬ 
ilar  behavior. 


Query  and  model  trajectories 


Figure  6:  Example  query  trajectory  (left)  and  corresponding 
model  trajectory  (right)  for  a  correct  match  between  two  users 
signing  the  digit  9. 


Query  and  model  trajectories 


Figure  7:  Example  confusion  between  query  digit  3  (left)  and 
model  digit  7  (right).  In  the  final  segment  of  the  query  digit  3 
the  elbow  rather  than  the  hand  is  falsely  matched  with  the  hand  of 
model  digit  7. 


6.  Conclusion  and  Future  Work 

Dynamic  space-time  warping  (DSTW)  is  a  general  method 
for  matching  time- series,  that  can  accommodate  multiple 
candidate  feature  vectors  at  each  time  step.  In  this  paper 
DSTW  has  been  applied  to  the  simultaneous  localization 
and  recognition  of  dynamic  hand  gestures.  The  algorithm 
can  recognize  gestures  using  a  fairly  simple  hand  detection 
module  that  yields  multiple  candidates.  The  system  does 
not  break  down  in  the  presence  of  a  cluttered  background, 
multiple  moving  objects,  multiple  skin-colored  image  re¬ 
gions,  and  users  wearing  short  sleeves  shirts. 

Incorporation  of  translation  invariance  further  increases 
the  system  flexibility  by  allowing  the  user  to  gesture  in  any 
part  of  the  image.  Scale  invariance  may  be  obtained  through 
the  commonly  used  image  pyramid  method,  or  alternatively 
by  detecting  certain  body  parts,  like  the  head  and  shoulders, 
and  measuring  their  size.  Implementing  scale  invariance 
remains  a  topic  for  future  investigation. 

Another  aspect  of  the  problem  that  has  not  been  ad¬ 
dressed  so  far  is  temporal  segmentation.  In  our  experiments, 
the  system  knew  the  starting  and  ending  frame  of  each  ges¬ 
ture.  In  a  real  application,  the  user  could  indicate  the  start 
and  end  of  a  gesture,  for  example  by  using  a  distinct  pose 
for  the  non-gesturing  hand  [8],  or  by  pressing  a  key. 

Our  current  implementation  uses  very  simple  features, 
both  for  hand  detection  and  for  DSTW  matching.  We  ex¬ 
pect  accuracy  to  improve  as  we  include  more  expressive 
features,  such  as  appearance-based  features  like  edges,  ori¬ 
entation  histograms  [8],  or  optical  flow  correlation  [6]  for 
recognition.  We  are  currently  working  on  efficient  and  ac¬ 
curate  methods  for  combining  such  features  within  a  dy¬ 
namic  framework. 
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